Unsupervised learning of word separators with MDL

نویسندگان

Aris Xanthos

François Bavaud

چکیده

This paper describes a novel algorithm for the unsupervised learning of word separators in raw text. The algorithm requires no language-specific knowledge regarding the text being processed. It relies solely on distributional properties of the text and uses the minimum description length (MDL) principle in order to partition characters into two subsets that correspond well with the traditional notion of letters and separators. The distinction between these types of characters emerges as an optimal solution to the problem of simultaneously compressing two elements: the lexicon that is obtained by tokenizing the text using the hypothesized separators, and the representation of the text under this lexicon. The performance of the proposed algorithm is evaluated on the basis of electronic text in English, French and German.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Word Induction Using Mdl Criterion

Unsupervised learning of units (phonemes, words, phrases, etc.) is important to the design of statistical speech and NLP systems. This paper presents a general source-coding framework for inducing words from natural language text without word boundaries. An efficient search algorithm is developed to optimize the minimum description length (MDL) induction criterion. Despite some seemingly over-s...

متن کامل

Can MDL Improve Unsupervised Chinese Word Segmentation?

It is often assumed that MinimumDescription Length (MDL) is a good criterion for unsupervised word segmentation. In this paper, we introduce a new approach to unsupervised word segmentation of Mandarin Chinese, that leads to segmentations whose Description Length is lower than what can be obtained using other algorithms previously proposed in the literature. Suprisingly, we show that this lower...

متن کامل

A Goodness Measure for Phrase Learning via Compression with the MDL Principle

This paper reports our ongoing research on unsupervised language learning via compression within the MDL paradigm. It formulates an empirical information-theoretical measure, description length gain, for evaluating the goodness of guessing a sequence of words (or character) as a phrase (or a word), which can be calculated easily following classic information theory. The paper also presents a be...

متن کامل

Unsupervised Lexical Learning As Inductive Inference via Compression

This paper presents a learning-via-compression approach to unsupervised acquisition of word forms with no a priori knowledge. Following the basic ideas in Solomonoff’s theory of inductive inference and Rissanen’s MDL framework, the learning is formulated as a process of inferring regularities, in the form of string patterns (i.e., words), from a given set of data. A segmentation algorithm is de...

متن کامل

Fully Unsupervised Word Segmentation with BVE and MDL

Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to gene...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

Unsupervised learning of word separators with MDL

نویسندگان

چکیده

منابع مشابه

Unsupervised Word Induction Using Mdl Criterion

Can MDL Improve Unsupervised Chinese Word Segmentation?

A Goodness Measure for Phrase Learning via Compression with the MDL Principle

Unsupervised Lexical Learning As Inductive Inference via Compression

Fully Unsupervised Word Segmentation with BVE and MDL

عنوان ژورنال:

اشتراک گذاری